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ABSTRACT 


With the advance of modern technology, more and more data are being recorded continuously 
during a time interval or intermittently at several discrete time points. They are both examples 
of “functional data”, which have become a prevailing type of data. Functional Data Analysis 
(FDA) encompasses the statistical methodology for such data. Broadly interpreted, FDA deals 
with the analysis and theory of data that are in the form of functions. This paper provides an 
overview of FDA, starting with simple statistical notions such as mean and covariance functions, 
then covering some core techniques, the most popular of which is Functional Principal Component 
Analysis (FPCA). FPCA is an important dimension reduction tool and in sparse data situations 
can be used to impute functional data that are sparsely observed. Other dimension reduction 
approaches are also discussed. In addition, we review another core technique, functional linear 
regression, as well as clustering and classihcation of functional data. Beyond linear and single or 
multiple index methods we touch upon a few nonlinear approaches that are promising for certain 
applications. They include additive and other nonlinear functional regression models, such as time 
warping, manifold learning, and dynamic modeling with empirical differential equations. The paper 
concludes with a brief discussion of future directions. 

KEY WORDS: Functional principal component analysis, functional correlation, functional linear 
regression, functional additive model, clustering and classihcation, time warping 


1 Introduction 


Functional data analysis (FDA) deals with the analysis and theory of data that are in the form of 
functions, images and shapes, or more general objects. The atom of functional data is a function, 

j d. Wh ile the term 
the history 


“functional data analysis” was coined by 

Ramsav 

(1982 

and 

Ramsav Sz Dalzell 

of this area is much older and dates back to 

Grenander 

(1950 

and 

Raol ( 
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intrinsically inhnite dimensional. The high intrinsic dimensionality of these data poses challenges 
both for theory and computation, where these challenges vary with how the functional data were 
sampled. On the other hand, the high or inhnite dimensional structure of the data is a rich source 
of information, which brings many opportunities. 

First generation functional data typically consist of a random sample of independent real-valued 
functions, Xi{t),, Xn (t), on a cornpact interval I = [O.T] on the real line. Such data h ave also 


ip 

,Ti 


been termed curve data (iGasser et al.l . 11984 iRice fc Silvermanl . Il99ll : iGasser fc Kneipl . 119951 ) . These 
real-valued functions can be viewed as the realizations of a one-dimensional stochastic process, often 
assumed to be in a Hilbert space, such as T^(J). Here a stochastic process X(t) is said to be an 

process if and only if it satishes B(fj X‘^(t)dt) < oo. While it is possible to model functional 
data with parametric approaches, usually mixed effects nonlinear models, the massive information 
contained in the inhnite dimensional data and the need for a large degree of hexibility, combined with 
a natural ordering (in time) within a curve datum facilitate non- and semi-parametric approaches, 
which are the prevailing methods in the literature as well as the focus of this paper. Smoothness of 
the individual function (or stochastic process), such as existence of continuous second derivatives, is 
often imposed for regularization, which is especially useful if nonparametric smoothing techniques 
are employed, as is prevalent in functional data analysis 

In this paper, we focus on hrst generation functional data with brief a discussion of next gener¬ 
ation functional data in Section 6. Here next generation functional data refers to functional data 
that are part of complex data objects, and possibly are multivariate, correlated, or involve images or 
shapes. Examples of next generation functional data include brain and neuroimaging data. A sepa¬ 
rate entry on functional data approaches for neuroimaging data is available at the same issue of the 
Annual Reviews (Link to John Aston’s contribution). For a brief discussion of next generation func¬ 
tional data, see page 23 of a report (http://www.worldofstatistics.org/wos/pdfs/Statistics&Science- 
TheLondonWorkshopReport.pdf) of the London workshop on the Future of Statistical Sciences held 
in November 2013. 

Although scientihc interest is in the underlying stochastic process and its properties, in reality 
this process is often latent and cannot be observed directly, as data can only be collected discretely 
over time, either on a fixed or random time grid. The time grid can be dense, sparse, or neither; 
and may vary from subject to subject. Originally, functional data were regarded as samples of fully 
observed trajectories. A slightly more general assumption is that functional data are recorded on 
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the same dense time grid ti,... ,tp for all n subjects. If the recording is done by an instrument, 
such as EEG or fMRI machine, the time grid is usually equally spaced, that is fj+i — ti = tj+i — tj 
for all i and j. In asymptotic analysis, the spacing tj+i — tj is assumed to approach zero as n 
tends to inhnity, hence p = pn is a. sequence that tends to inhnity. On one hand, large p leads 
to a high-dimensional problem, but also means more data so should be a blessing rather than a 
curse. This blessing is realized by imposing a smoothness assumption on the processes, so that 
information from measurements at neighboring time points can be pooled to overcome the curse of 
dimensionality. Thus, smoothing serves as a tool for regularization. 

While there is no formal dehnition of “dense” functional data, the convention has been that Pn 
has to converge to inhnity fast enough to allow the corresponding estimate for the mean function 
p{t) = EX{t), where X is the underlying process, to attain the parametric ^/n convergence rate for 
standard metrics, such as the norm. Sparse functional data arise in longitudinal studies where 
subjects are measured at different time points and the number of measurements n* for subject i may 
be bounded away from inhnity, i.e., supi<j<„nj < G < oo for some constant C. A ri gorous dehnition 
of th e types of functional data based on their sampling plans is still lacking, see fjZhang fc Wang . 


2014) for a possible approach, with further details in Section 2 below. 


In reality, the observed data often are contaminated by random noise, referred to as measure¬ 
ment errors, which are often assumed to be independent across and within subjects. Measurement 
errors can be viewed as random huctuations around a smooth trajectory, or as actual errors in the 
measurements. A strength of FDA is that it can accommodate measurement errors easily because 
for each subject one observes repeated measurements. An interesting, but perhaps not surprising, 
phenomenon in FDA is that the methodology and theory, such as convergence rates, varies with the 
sampling plan of the time grid, i,e,, the measurement schedule. Intriguingly, sparse and irregularly 
sampled functional data, that we synonymously refer to as longitudinal data, typically require more 
effort in theory and methodology as compared to densely sampled functional data. Functional data 
that are, or assumed to be, observed continuously without errors are the easiest type to handle 
as theory for stochastic processes, such as functional laws of large numbers and functional central 
limit theorems, are readily applicable. A comparison of the various approaches will be presented in 
Section 2, with discussion of a unihed approach for various sampling plans. 

One challenge in functional data analysis is the inverse nature of functional regression and most 
functional correlation measures. This is triggered by the compactness of the covariance operator, 
which leads to unbounded inverse operators. This challenge will be discussed further in Section 3, 
where extensions of classical linear and generalized linear models to functional linear and generalized 
functional linear models will be reviewed. Since functional data are intrinsically inhnite dimensional, 
dimension reduction is key for data modeling and analysis. The principal component approach will 
be explored in Section 2 while several approaches for dimension reduction regression will be discussed 
in Section 3. 

Clustering and classihcation of functional data are useful and important tools in FDA with 
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wide ranging applications. Methods include extensions of classical /c-means and hierarchical clus¬ 
tering, Bayesian and model-based approaches to clustering, as well as functional regression based 
and functional discriminant analysis approaches to classihcation. These topics will be explored in 
Section 01 

The classical methods for functional data analysis have been predominantly linear, such as 
functional principal components or the functional linear model. As more and more functional data 
are being generated, it has emerged that many such data have inherent nonlinear features that make 
linear methods less effective. Sections 5 reviews some nonlinear approaches to FDA, including time 
warping, non-linear manifold modeling, and nonlinear differential equations to model the empirical 
dynamics inherent in functional data. 

A well-known and well-studied nonlinear effect is time warping, where in addition to the common 
amplitude variation one also considers time variation. This creates a basic non-identihability prob¬ 
lem. Section 5.1 will provide a discussion of these foundational issues. A more general approach 
to model nonlinearity in functional data that extends beyond time warping and includes many 
other nonlinear features that may be present in longitudinal data is to assume that the functional 
data he on a nonlinear (Hilbert ) manifold. The startin g point for such models is the choice of a 
suitable distance and ISOMAP ( Tenenba.nm et ah . boOO ) or related methods can then be employed 
to uncover the manifold structure and dehne functional manifold means and components. These 
approaches will be described in Section 5.2. Modeling of time-dynamic systems with differential 
equations that are learned from many realizations of the trajectories of the underlying stochastic 
process and the learning of nonlinear empirical dynamics such as dynamic regression to the mean 
or dynamic explosivity is briefly reviewed in Section 5.3. 

Section 6 concludes this review with a brief outlook on the future of functional data analysis, 
where the emphasis shifts to next generation functional data. 

Research tools that are useful for handing functional data include various smoothing methods, 

ent reference books 


notab ly kernel, local least s c 


exist ( Wand fc Jone^ 


1995 


uares and spline smoothing for which various excel 


Fan fc Giibelsl. 


. Il994 : 


1996 


Fnbank. 


Riesz fc Sz 


-NagvI. 


1999 


de Boon. 1200 ll) and knowledge on 


19901). Several software packages are publicly 


functional analysis flConwav 
available to analyze functional data, including software at the Functional Data Analysis website of 
James Ramsay (http://www.psych.mcgill.ca/misc/fda/), the fda package on the crane project of R 
(http: / / cran. r-pro j ect. org / web / packages / fda / fda. pdf), 

the Matlab package PACE on the website of the Statistics Department of the University of Califor¬ 
nia, Davis (http://www.stat.ucdavis.edu/PACE/), and the R package refund on functional regres¬ 
sion (http://cran.r-project.org/web/packages/refund/index.html). 

This review is based on a subjective selection of topics in FDA that the authors have worked on or 
hnd of particular interest. We do not attempt to provide an objective or comprehensive review of this 
fast moving held and apologize in advance for any omissions of relevant work. By now there are many 
alternative approaches to handle functional data. Interested readers can explore the various aspects 
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several monographs ( 
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, 2005; 

Ferratv & Vieu. 

2006: Wu & Zhang. 
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Ramsav et ah 
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, 2012; 
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and review articles i 

Ricel. 

2004 

Zhao et ah 
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2005 

2008: 
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2006 
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special journal issues were devoted to FDA including a 2004 issue of Statistica Sinica (issue 3), a 
2007 issue in Computational Statistics and Data Analysis (issue 3), and a 2010 issue in Journal of 
Multivariate analysis (issue 2). 


2 Mean and Covariance Function, and Functional Principal 
Component Analysis 

In this section, we focus on first generation functional data that are i.i.d. realizations of a stochastic 
process X that is in and dehned on the interval I with mean function fi{t) = E{X{t)) and 
covariance function S(s,t) = cov(X(s),X(t)). The functional framework can also be extended to 

processes with multivariate arguments. The realization of the process for the Ah subject is 
Xi = Xi{-), and the sample consists of n subjects. For generality, we allow the sampling schedules 
to vary across subjects and denote the sampling schedule for subject i as and the 

corresponding observations as X* = {Xu ,..., where Xij = Xi{tij). In addition, we allow the 

measurement of Xij to be contaminated by a random noise Cij with E{eij) = 0 and var(ejj) = 
so the actual observed value is Yij = X^ + e^j, where Cij are independent across i and j and often 
termed “measurement errors”. 

It is often assumed that the errors are homoscedastic with afj = but this is is not strictly 
necessary, as long as afj = var(e(tij)) can be regarded as the discretization of a smooth variance 
function cr^(t). We observe that measurement errors are realized only at those time points tij 
where measurements are being taken. Hence these errors do not form a stochastic process e{t) but 
rather should be treated as discretized data Cij. However, in order to estimate the variance afj of 
Cij it is often convenient to assume that there is a latent smooth function a{t) such that aij = a^{tij). 

Estimation of Mean and Covariance Functions. When subjects are sampled at the same time 
schedule, i.e., t^j = tj and Ui = m for all i, the observed data are m-dimensional multivariate data, 
so the mean and covariance can be estimated empirically at the measurement times by the sample 
mean and sample covariance, fi{tj) = ^ and S(4,t/) = ^ Yfi=i(Xik - K'tik)){yii - KUi)), 

for k ^ 1. Missing data (missing completely at random) can be handled easily by adjusting the 
available sample size at each time point tj for the mean estimate or by adjusting the sample sizes 
of available pairs at {tk,ti) for the covariance estimate. An estimate of the mean and covariance 
functions on the entire interval I can then be obtained by smooth interpolation of the corresponding 
sample estimates or by mildly smoothing over the grid points. Such a smoothing step enhances the 
global estimate of the mean and auto-covariance functions and consistency can be attained only if 
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m = rrin grows with the sample size and approaches inhnity. Once we have a smoothed estimate S 
of the covariance function S, the variance of the measurement error at time tj can be estimated as 
^ Er=i(^o - because var(y(f)) = var(X(t)) + a^{t). 

When the sampling schedule of subjects differs, the above sample estimates cannot be obtained. 
However, one can borrow information from neighboring data and across all subjects to estimate the 
mean function, provided the sampling design combining all subjects, i.e. l<j< 

Hi}, is a d ense subset of t he int erval I. Then a nonparametric smoother, such as a local polynomial 
estimate flFan fc Giibelsl . Il996l) . can be applied to the scatter plot {{Yij, tij) : i = 1,... ,n, and j = 
1,... ,nj} to smooth against tij across time and will yield consistent estimates of fi{t) for all 
t. Likewise, the covariance can be estimated on / x / by a two-dimensional scatter plot smoother 
{{uiki, Uk, til) ■ i = I,... ,n-, k,l = 1,... ,ni, k 1} to smooth Uiki against {Uk, Ui) across the two 
dimensional product time intervals, where um = {Yik — ft(tik))(Yii — ftitu)) are the raw covariances. 
We note that the diagonal raw covariances where k = I are removed from the 2D scatter plot 
prior to the smoothing step because these include an additional term that is due to the variance 
of the measurement errors in the observed Yij. Indeed, once an estimate S for S is obtained, the 
variance cr’^it) of the measurement errors can be obtained by smoothing Yij — jlitij)"^ — 'E(tij) against 
tii across t ime. A better estimate for a under the homoscedasticity assumption is discussed in 
fennhali . 


Yao et al. 


The above smoothing approach is based on a scatter plot smoother which assigns equal weights 
to each observation, therefore subjects with a larger number of repeated observations receive more 


total weight, and hence contribute mo re toward t 


le estimates of the mean and covariance functions. 


An alternative approach employed in lLi &: Hsind (120101) is to assign equal weights to each subject. 


Both approaches are sensible. A question is which one would be preferred for a particular design 
and whether there is a unihed way to deal with these two m ethods and their theory. These issues 
were recently explored in a manuscript fjZhang Sz Wand. I2014J ). employing a general weight function 
and providing a comprehensive analysis of the asymptotic properties on a unihed platform for three 
types of asymptotics, and L°° (uniform) convergence as well as asymptotic normality of the 
general weighted estimates. Functional data sampling designs are further partitioned into three 
categories, non-dense (designs where one cannot attain the ^/n rate), dense (where one can attain 
the y/n rate but with a non-neglible asymptotic bias), and ultra-dense (where one can attain the 
y/n rate without asymptotic bias). Sparse sampling scenarios where Ui is uniformly bounded by a 
hnite constant are a special case of non-dense data and lead to the slowest convergence rates. These 
designs are also referred to as longitudinal designs. The differences in the convergence rates also 
have ramihcations for the construction of simultaneous conhdence bands. For ultra dense or some 
dense functional data, the weighing scheme that assigns equal weights to subjects is generally more 
efficient than the scheme that assigns equal weight per observation but the situation is reversed for 
many other sampling plans, including sparse functional data. 
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Hypothesis Testing and Simultaneous Confidence Bands for Mean and Covariance 
Functions . Hyp othesis testing for the comparison of mean functions p is of obvious interest. 


further work by 

Guevas et al. 

(2004 

) and 

Zhane 

(2013 

). Other two sample tests were studied 

for distributions of functional data ( 

dall & Van Keileeom. 2007) and for the covariance functions 

fPanaretos et ah. 

2010: Boente et al. 

2011 

• 


Another infer ence pro 
band s for dense (jPegrasl . 


jlem t hat has been explored is the construction of s imultaneous confidence 


2008 


, I2OIII: IWang fc Yand. 12009 : 


Cao et ah 


20121) and sparse llMa et ah 


20121) functional data. However, the problem has not been completely resolved for functional data 


due to two main obstacles, the infinite dimensionality of the data and the nonparametric nature 
of the target function. For the mean function p, an interesting “phase transition” phenomenon 
emerges: For ultra-dense data the estimated mean process — fi{t)) converges to a mean 

zero Gaussian process W(t), for t E I, so standard continuous mapping leads to a construction of 
a simultaneous confidence band based on the distribution of supjhF(t). When the functional data 
are dense but not ultra dense, the process can still converge to a Gaussian process 

W{t) with a proper choice of smoothing parameter but W is no longer centered at zero due to the 
existence of asymptotic bias as discussed in Section 1. 

This resembles the classical situation of estimating a regression function, say rn{t), based on 
independent scalar response data, where there is a trade off between the bias and variance so 
optimally smoothed estimates of the regression function will have an asymptotic bias. The con¬ 
ventional approach to construct a pointwise confidence interval is based on the distribution of 
r„(m(t) — E{m{t))), where m[t) is an estimate of m{t) at the optimal rate r„. This means that 
the asymptotic confidence interval derived from it is targeting E{m{t)) rather than the true target 
m{t) and therefore is not really viable for inference. 

In summary, the construction of simultaneous confidence band for functional data requires dif¬ 
ferent methods for ultra-dense, dense, and sparse functional data, whe r e in t he latter case one does 


not have tightness and the rescaling approach of 


Bickel Sz Rosenblatt 


f I 973 I) may be applied. The 


divide between the various sampling designs is perhaps not unexpected since ultra dense functional 
data falls along the paradigm of parametric inference where the y/n rate of convergence is attained 
with no asymptotic bias, while dense functional data attains the parametric rate of y/n convergence 
albeit with an asymptotic bias, which leads to challenges even in the construction of pointwise con¬ 
fidence intervals. Unless the bias is estimated separately, removed from the limiting distribution, 
and proper asymptotic theory is established, which usually requires regularity conditions for which 
the estimators are not efficient, the resulting confidence intervals need to be taken with a grain 
of salt. This issue is specific to the bias-variance trade off that is inherited from nonparametric 
smoothing. Sparse functional data follow a very different paradigm as they allow no more than 
nonparametric convergence rates, which are slower than y/n, and the rates depend on the design of 
the measurement schedule and properties of mean and covariance function as well as the smoother 
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fjZhang Sz Wand . l2014f) . The phenomenon of nonparametric versus parametric convergence rates 


as designs g e t mor e regular and denser characterize a sharp “phase transition” (iHall et ah 
Cai fc Yuan . 2011 ). 


2006 


Func tional Principal Component Analysis (FPCA). Principal component analysis fjjolliffe . 
2002 h is a key dimension reduction tool for multivariate data that has been extended to functional 
data and ter med functional pr i ncipal component a nalysis (FPC A). Althou gh the basic ideas were 


conceived in 


Grenanderl fll950h : iKarhiinenl fjl946l) : 


Loeve 


fll946l) 


and 


Rad fll958l) . a more compre¬ 


hensive framework for statistical inference for FPCA was h rst developed in a jo int Ph.D. thesis of 
Dauxois and Pousse (1976) at the University of Toulouse (IDauxois et al.l. 119821) . Since then, this 
approach has taken off to become the most prevalent tool in FDA. This is partly because FPCA facil¬ 
itates the conversion of inherently inhnite-dimensional functional data to a hnite-dimensional vector 
of random scores. Under mild assumptions, the underlying stochastic process can be expressed as a 
countable sequence of uncorrelated random variables, the functional principal components (FPCs) 
or scores, which are then truncated to a hnite vector. Then the tools of multivariate data analysis 
can be readily applied to the resulting random vector of scores, thus accomplishing the goal of 
dimension reduction. 

Specihcally, the dimension reduction is achieved through an expansion of the underlying but 
often not fully observed random trajectories Xi{t) in a functional basis that consists of the eigen¬ 
functions of the auto-covariance operator of the process X. With a slight abuse of notation we 
dehne the covariance operator as ^(^f) = Jj'E{s,t)g{s)ds, for any function g G T^, using the same 
notation for the covariance operator and covariance function. Because of the int egral for i n. the co- 
variance operator is a trace class and hence compact Hilbert-Schmidt operator fjConwavl . 119941) . It 
also has real-valued nonnegative eigenvalues Xj, because it is symmetric and non-negative dehnite. 
Under mild assumptions, Mercer’s theorem implies that the spectral decomposition of S leads to 
S(s, t) = ^k<Pk{s)(f)k(t), with uniform convergence, where Afc are the eigenvalues (in descending 

order) of t he covariance operator and (j)k the corresponding orthogonal eigenfunctions. Karhunen 


and Loeve flKarhunenl . 


1946 


Loevel . Il946l ) independently discovered the FPCA expansion 

OO 

Xi{t) = fi{t) + 


( 1 ) 


k=l 


where Aik = fj{Xi(t) — g.(t))cl)k(t)dt are the functional principal components (FPCs) of A*. The 
Aik are independent across i for a sample of independent trajectories and are uncorrelated across k 
with E{Aik) = Ovar(Ajfc) = A^. The convergence of the sum in ([T]) is with respect to the norm. 
Expansion ([1]) facilitates dimension reduction as the hrst K terms for large enough K provide a 
good approximation to the inhnite sum and therefore for A,, so that the information contained in 
Aj is essentially contained in the A-dimensional vector Aj = (A^i,..., Ai^) and one works with the 





































approximated processes 


K 


XiK{t) = fi{t) + y^^Aik4>k{t), 


( 2 ) 


k=l 


Analogous dimension reduction can be achieved by expanding the functional data into other 
function bases, such as spline, Fourier, or wavelet bases. What distinguishes FPCA is that for 
a given number of K components the expansion that uses these components explains most of 
the variation in X in the sense. When choosing K in an estimation setting, there is a trade 
off between bias (which gets smaller as K increases due to the smaller approximation error) and 
variance (which increases with K as more components must be estimated, adding random error). 
So a model selection procedure is needed, where typically K = Kn is considered to be a function of 
sample size n and must tend to inhnity to obtain consistency of the representation. Because of 
this, the theory for FPCA is quite different from standard multivariate analysis theory. 

The estimation of the eigencomponents (eigenfunctions and eigenvalues) of FPCA is straight¬ 
forward once the mean and covariance of the functional data have been obtained. To obtain the 
spectral decomposition of the covariance operator, which yields the eigencomponents, one simply 
approximates the estimated auto-covariance surface cov(W(s), X(f)) on a hnite grid, thus reducing 
the problem to the corresponding matrix spectral decomposition. The convergence of the estimated 
eigencomponents is obtained by combining results on the convergence of the covariance esti mates 
that a re achieved under regularity conditions with perturbation theory (see Chapter VIII of iKa 

JlQSnlB . 

For situations where the covariance surface cannot be estimated at the ^/n rate, the convergence 
of estimates is typically influenced by the smoothing method that is employed. Consider the sparse 
case, where the convergence rate of the covariance surface corresponds to the optimal rate at which a 
smooth two-dimensional surface can be estimated. Intuition suggests that the eigenfunction, which 
is a one-dimensional function, should be estim able at the one-dim ensional optimal rate for smoothing 

(1200611 , where eigenfunction estimates were 


Hall et ah 


methods. An affirmative answer is provided in 
shown to attain the better (one-dimensional) rate of convergence, if one is undersmoothing the 
covariance surface estimate. This phenomenon resembles a scenario encountered in semiparametric 
inference, where a y/n rate is attainable for the parametric component if one undersmooths the 
nonpararmetric component before estimating the parametric component. This undersmoothing can 
be avoided so that the same smoothing parameter can be employed for both the parametric and 
nonparametric component if a prohle approach is employed to estimate the parametric component. 
An interesting and still open question is how to construct such a prohle approach so that the 
eigenfunction is the direct target of the estimation procedure, bypassing the estimation of the 
covariance function. 

Another open question is the choice of the number of components K needed for the approxi¬ 
mation (j2]) of the full Karhunen-Loeve expansion ([1]) for various applications of FPCA. There are 
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several ad hoc procedures that are routinely applied in multivariate PCA, such as the scree plot or 
the fraction of variance explained by the hrst few PC components, which can be directly extended to 
the functional setting. Other approaches ar e pseudo-v e rsions of AIC (Akaike information criterion) 
and BIG (Bayesian information criterion) flYao et ahl. l2005a|) . where the latter selects fewer com¬ 
pone nts. Cross-validation with one-curve-leave-out has also been investitgated (iRice fc SilvermaiJ . 
199ll) . but tends to overht functional data by selecting to o large K in (121). Additional work for model 


selection is needed, where an initial analysis is due to iHall fc Viall fl2006l l. who studied a special 
case where the functional data are of hnite dimension, i.e. the total number of components in ([1]) is 
a hnite integer rather than cx). In addition to the order K one also needs to choose various tuning 
parameters for the smoothing steps and their optimal selection in the context of FDA remains a 
challenge due to the auto-correlations within a subject. 


FP CA for fully obser ve d fun c tiona 




Silverman 


(Ii996li . IbosqI (l2nnnl) : 


data was studied in 


Boente fc Fraiman 


Dauxois et al 


FPCA for densely observed function al dat a was explored in 

(I1991I1 


); 

Pezzulli & Silverman 

(1993 

) and 

Gardot 

(2000) 


(l2nnnh: 


^ 198211 . 


Besse fc Ramsav 


Hall fc Hosseini-Nasab 


Castro et al 


(119861); 


(I2nn6h. 


Rice fc Silverman 


For the much more difficult but com- 


m only encountered situation of spars e func ti onal data, t h e FP C A approach was in vestigated 


in 

Shi et al. 

(1996 


Staniswalis & Lee 

(1998) 

(2 

005a) 

Yao & Le( 


2006 

); and 

Paul & 

Peng 


to incorporate covariates flChiou et al. 


2003 


James et al. 


(l2000h : iRice fc Wnl fl200lh : 


Yao et al. 


2 OO 9 II. The FPCA approach has also been extended 


Cardot 


2 OO 7 I: IChiou fc Miillerl . 120091) for vector co¬ 


variates an d dense functional data, an d also for sparse functional data with vector or functional 


covariates fjjiang fc Wand . 


2010 


20 111 ) . 


The aforementioned approaches of FPCA are not robust against outliers because principal com¬ 
ponent analysis involves second order moments. Outliers for functional data have many different 
facets due to the high dimensionality of these data. They can appear as outlying measurements at 
a single or several time points, or as an outlying shape of an entire function. Current approaches 
to deal with outliers and conta mination and more g enera l ly visual explor a tion o f functional data 
include explorat ory box plots 


sions of FPCA fICrambes e 


; al 


Hvndman fc S 


2008 


rang 


Gervini 


2008 


2010 : 


Sun fc Genton. 


Bah et al 


2011 


2011) and robust ver- 


Kraus fc Panaretosl . 


2012 


20141) . More research on outlier detection and robust FDA approaches 


Boente fc Salibian-Barreral. 

are needed. 

Applications o f FPCA. The FP GA approach motivates the concept of modes of variation for 
functional data ( Jones fc Ricel. 1992 ). a most useful tool to visualize and describe the variation in 
the functional data that is contributed by each eigenfunction. The k—i\i mode of variation is the 
set of functions 

p(f) ± a0^(/)fc(f), tel, ae [-A, A], 

that are viewed simultaneously over the range of a, usually for A = 2, substituting esti¬ 
mates for the unknown quantities. Often the eigencomponents and associated modes of varia- 
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tion have co mpelling and sometimes strik ing interpretations, such as for 


tional traits flKirkpatrick fcHeckmanl . 1198911 and in many other applications flKneip fc Utikah 


he evolution o: 


func- 


2001 


Ramsay fc Silverman! . l2002h . FPCA also facilitates functional principal component regression by 
mapping the function to its hrst few principal components, then employing regression models with 
vector predictors. Since FPCA is an essential dimension reduction tool, it is also useful for classih- 
cation and clustering of functional data (see Sections 3). 

Last but not least, FPCA facilitates the construction of parametric models that will be more 
parsimonious. For instance, if the hrst two principal components explain over 90% of the variation 
of the data, then one can approximate the original functional data with only two terms in the 
Karhunen-Loeve expansion ([T]). If in addition that the hrst eigenfunction is nearly linear in time 
and explains over 80% of the total variation of the data, then a parametric linear mixed-ehects 
model with a linear time trend for random ehects likely will ht this data well. This underscores the 
advantages to use a nonparametric approach such as FDA prior to a model-based longitudinal data 
analysis for data exploration. The exploratory analysis then may suggest viable parametric models 
that are more parsimonious than FPCA. 


3 Correlation and Regression: Inverse Problems and Di¬ 
mension Reduction for Functional Data 


As mentioned in Section 1, a major challenge in FDA is the inverse problem, which stems from 
the compactness of the covariance operator. Consider S = cov{X{s),X(t)) with eigenvalues A^, 
k = 1,... ,oo. If there are only hnitely many, say K, positive eigenvalues, the functional data 
become hnite dimensional and can be fully described by the iC-dimensional principal component 
scores (in addition to the mean function and the K eigenfunctions). In this case, the functional 
data may be viewed as iC-dimensional multivariate data and the covariance matrix is equivalent 
(or isomorphic) to a K by K invertible matrix. Then the same considerations as for multivariate 
data apply and the inverse problem is equivalent to that of multivariate data with no additional 
complications for functional data. 

If on the other hand that there are inhnitely many nonzero, hence positive, eigenvalues, then 
the covariance operator is a one-to one function. In this case, the inverse operator of S exists 
but is an unbounded operator and the range space of the covariance operator is a compact set 
in L^. This creates a problem to dehne a bijection, as the inverse of S is not dehned on the 
entire space. Therefore regularization is routinely adopted for any procedure that involves an 
inverse operator. Examples where inverse operators are central include regression and correlation 
measures for functional data, as appears in these m 
example addressed for functional canonical correlation in 


He et ah 

(2000 

) and 

He et ah 

120031. 


where a solution was proposed under certain constraints on the decay rate of the eigenvalues and 
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the cross covariance operator. 


3.1 Functional Correlation 

Different fnnctional correlation measures have been discussed in the literature. Functional Canonical 
Correlation Analysis serves here to demonstrate some of the problems that one encounters in FDA 
as a consequence of the non-invertibility of compact operators. 

Functional Canonical Correlation Analysis (FCCA). Let {X,Y) be a pair of random 
functions in L‘^{Ix) and L‘^{Iy) respectively. The first functional canonical correlation coeffi¬ 
cient Pi and its associated weight functions are defined as follows, using the notation 

(/i, / 2 ) = Ji fi{t)f 2 {t)dt for any /i, /s G 

pi= sup cov{{u,X),{v,Y)) = cov{{ui,X),{vi,Y)), (3) 

ueL^iix),veL^(iY) 

subject to var((M, X)) = 1 and var((n, F)) = 1. Analogously for the kth, k > 1, canonical correlation 
Pk and its associated weight functions {uk,Vk), 

pk= sup cov{{u,X),{v,Y)) = cov{{uk,X),{vk,Y)), (4) 

ueL^(ix),veL^iiY) 

subject to var((M,X)) = 1, var((n,F)) = 1, and that the pair {Uk,Vk) = {{uk, X), {vk,Y)) is 
uncorrelated to all previous pairs {Uj, Vj) = ((wj, X), {vj, F)), for j = 1,..., /c — 1. 

Thus, FCCA aims at finding projections in directions Uk of X and Vk of F such that their 
linear combinations (inner products) Uk and Vk are maximally correlated, resulting in the series of 
functional canonical components {pk, Uk, Vk, Uk, Vk), k > 1, directly extending canonical correlations 
for multivariate data. Because of the flexibility in the direction ui, which is infinite dimensional, 
over fitting may occur if the number of sample curves is not large enough. Formally, this is due 
to the fact that FCCA is an ill-posed problem. Introducing the cross-covariance operator Sxy : 
L\Iy) ^ L^Ix), 

ExYv{t) = J cov (X(t),Y(s))v(s)ds, (5) 

for V G T^(/y) and analogously the covariance operators for X, Exx, for F, Eyy, and using 
cov((m,X), {v,Y)) = {u, ExyX), the kth canonical component in (j4]) can be expressed as 

Pk= sup {u,Rxyv) = {uk,RxYVk)- (6) 

Then ([6]) is equivalent to an eigenanalysis of the operator R = E^^^ExyEy^y"^ ■ Existence of 
the canonical components is guaranteed if the operator R is compact. However, the inverse of a 
covariance operator and the inverses of or are not bounded since a covariance operator 
is compact under the assumption that the covariance function is square integrable. A possible 
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approach ( He et ah . 2003 ) is to restrict the domain of the inverse to the range Ax of so 


that the inverse of ^xx dehned o n Ay anc 


is a b ijective mapping Ax to Bx, under some 
conditions (e.g., Conditions 4.1 and 4.5 in He et ah ( 20031) ) on the decay rates of the eigenvalues of 
T,xx and Syy and the cross-covariance. Under those assumptions the canonical correlations and 
weight functions are well dehned and exist. 

An alternative way to get around the above ill-posed problem is to restrict the maximization in 
(j3]) and (jl)) to discrete spaces tha t are restricted to a rep roducing kernel Hilbert space instead of 


working within the entire space flEubank fc Hsingl. 120081 ). Since FCCA is inherently regularized, 


the hrst canonical correlation often tends to be too large and its value is difhcult to interpret, as it 
is highly dependent on the value of the regularization parameter. This overhtting problem, which 
also can be vi ewed as a conse q uence of the high-dimensionality of the weight function, was already 
illustrated in iLeurgans et al.l (119931) . who were the hrst to explore penalized FCCA. functional 


canonical correlation regularized by a penalty. Despite the challenge with overhtting FCCA can be 
also employed to implement functional regres sion prob l em by using the canonical weight functions 


Uk, and Vk as bases to expand the regression (IHe et al. 


2000 


20101 ) 


Another difficulty with the versions of FCCA proposed so far is that it requires densely recorded 
functional data so the inner products in (j3]) can be evaluated with high accuracy. Although it is 
possible to impute sparsely observed functional data using the Karhunen-Loeve expansion ([T]) before 
applying any of the canonical correlations, a prediction error will result from such an imputation 
leading to a biased correlation. This bias may be small in practice but hnding an ehective FCCA 
for sparsely observed functional data is still of interest and remains an open problem. 

Other Functional Correlation Measures. The regularization problems for FCCA have moti¬ 
vated the study of alternative notions of functional correlation. These include singular correlation 
and singular expansions of paired processes {X,Y). While the hrst correlation coefficient in FCCA 
can be viewed as Pfcca = sup||„||^||^||^]^ corr((M, X), {v, Y)),, observing that it is the correlation that 
induces the inverse problem, one could simply replace the correlation by covariance, i.e., obtain 
project functions Ui,Vi that attain sup||„||^||.y||^]^ cov((m, X), {v,Y)). Funct ions u ^, v^ turn ou t to be 
the hrst pair of the singular basis of the covariance operator of {X,Y) (lYang et ahl. 1201 if) . This 
motivates to dehne a functional correlation as the hrst singular correlation 


PSCA = 


cov((ui,X), (ui,y)) 
^var((Mi, X)) var((ui, Y)) 


(7) 


Another natural approach that also avoids the inverse problem is to dehne functional correlation 
as the cosine of the angle between functions in L^. For this notion to be a meaningful measure of 
alignment of shapes, one hrst needs to subtract the integrals of the functions, i.e., their projections 
on the constant function 1, which corresponds to a “static part”. Again considering pairs of processes 
{X,Y) = {Xi,X 2 ) and denoting the projections on the constant function 1 by Mk = {Xk, 1), k = 
1,2, the remainder Xk — Mk, A; = 1, 2, is the “dynamic part” for each random function. The L^-angle 
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of these dynamic part can thus serves as a distance measure of them, whi ch leads to a correlatio n 
measure of functional shapes. These ideas can be formalized as follows flDubin fc Miillerl. l2005h . 
Dehning standardized curves either by X^(t) = {Xk(t) — Mk)/{J{Xk{t) — or alternatively 

by also removing /i^ = EXk, X^{t) = {Xk{t) - Mu - {Xk{t) - Mk - , the 

cosine of the angle between the standardized functions is pk^i = E{Xl,Xl). The resulting dynamic 
correlation and other notions of functional correlation can also be extended to obtain a precision 
matrix for functional data. This approach has been developed by lOpgen-Rhein fc Strimmerl (120061) 
for the construction of a graphical networks for gene time course data. 


3.2 Functional Regression 

Functional regression is an active area of research and the approach depends on whether the re¬ 
sponses or covariates are functional or vector data and include combinations of (i) functional re¬ 
sponses with functional covariates, (ii) vector responses with functional covaria^ 
tional responses with vector covariates. An approach for (i) was introduced by 


es, and fiii) func- 


Ramsav fc Dalzell 


fjl99ll) who develo ped the fun c tional linear model (FLM) flT^ for this case, where the basic idea 


already appears in iGrenanderl fll9^ . who derives this as the regression of one Gaussian process 


on another. This model can be viewed as an extension of the traditional multivariate linear model 
that associates vector responses with vector covariates. The topic that has been investigated most 


covariates are functions. Reviews of FLMs are 


Miiller 

(2005, 

2011 

) and a recent review in 

Morris 


(120151) . Nonlinear functional regression models will be discussed in Section 5. In the following we 
give a brief review of the FLM and its variants. 

Functional Regression Models with Scalar Response. The traditional linear model with 
scalar response Y and vector covariate X G TZ^ can be expressed as 


^ — /^o + (X, (3) -|- e. 


( 8 ) 


using the inner product in Euclidean vector space, where (do and jS contain the regression coefficients 
and e is a zero mean hnite variance random error (noise). Replacing the vector X in ([8]) and the 
coefficient vector by a centered functional covariate X^ = X{t) — p{t) and coefficient function 
fd = /5(f), for t E I, one arrives at the functional linear model 


^ — /5o + {X'^, P) + e — (do d- J X'^{t)P{t)dt + e. 


(9) 


which has been studied extensively (jCardot et al.l . Il999l. 


2003 


Hall fc Horowitzl . 12007 ). 


An ad hoc approach is to expand the covariate X and the coefficient function /5 in the same 
functional basis, such as the B-spline basis or eigenbasis in ([1]). Specihcally, consider an orthonormal 
basis (fk, k > 1, oi the function space. Then expanding both X and /5 in this basis leads to 
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X{t) = (3{t) = and model ([9]) is equivalent to the traditional linear 

model ([H]) of the form 


X — j^Q + f3k^k + e, 


( 10 ) 


k=l 


where in implementations the sum on the r.h.s. is replaced by a hnite sum that is truncated at the 
hrst K terms, in analogy to (j2]). 

To obtain consistency for the estimation of the parameter function /9(t), K = Kn i n (ITOil needs 
to in crease with the sample size n. For the theoretical analysis, the method of sieves flGrenanderl . 
198111 can be applied, where the Kth sieve space is the linear subspace spanned by the hrst K = Kn 
components. In addition to the basis-expansion approach, a p e nalize d approach using either P- 
splines or smoothing splines has also been studied (jCardot et ahl. 1200311 . For the special case where 
the basis functions (pk are selected as the eigenfunctions (pk of X, the basis representation approach 
in dH]} is equivalent to conducting a principal component regression albeit with an increasing number 
of principal components. In this case, however, the basis functions are estimated rather than pre- 
specihed, and this adds an additional twist to the theoretical analysis. 

The simple functional linear model ([9]) can be extended to multiple functional covariates 
Xi,..., Xp, also including additional vector covariates Z = [Zi, ..., Zg), where Zi = 1, by 


Y = {z,e) 


E 

i=i 


X^{t)(3j{t)dt + e. 


( 11 ) 


where Ij is the interval where Xj is dehned. In theory, these intervals need not be the same. 
Although model OHi is a straightforward extension of ([9]), its inference is different due to the 
presence of the parametric component 9. A co mbined least sq uares method to estimate 9 and (3j 
simultaneously in a one step or prohle approach flHu et al.l . 1200411 . where one estimates 9 by prohling 
out the nonparametric components (3j, is generally preferred over an alternative back-htting method. 
Once the parameter 9 has been estimated, any approach that is suitable and consistent for htting 
the functional linear model ([9]) can easily be extended to estimate the nonparametric components 
[3k by applying it to the residuals Y — {9, 7 [). 

Extending the linear setting with a single index X'^{t)[3{t)dt to summarize each function 
covariate, a nonlinear link function g can be added in ([9]1 to create a functional generalized linear 
model (either within the exponential family or a quasi-likelihood framework and a suitable variance 
function) 


y = gWo + J X^{t)(3{t)dt) + e. 


( 12 ) 


This model has been con sidered when g is known ( 


2005 


(James. 

2002 ; 

Cardot et ah 

2003: 


Wang et al.l. 120101) and when it is unknown flMuller fc Stadtmiillerl. 


2005 


Cardot fc Sardal. 


Chen et ah 


2011 ah . 


When g is unknown and the variance function plays no role, the special case of a single-index model 
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has further been extended to multiple indices, the number of which is possibly unknown. Such 
“multiple functional index models” typically forgo the additive error structure imposed in (jH]) - 

(na. 

K = 9(Jx‘(t)e,(t)dt,.Jx‘(t)e,{t)dt,e), (13) 

where g is an unknown multivariate function on This line of research follows the paradigm 

of sufficient dimension reduction approach es, which was hrst propose d for vector covariates as an 


off-shoot of sliced inverse regres s ion fS I R.l llDnan fc Li 


1991 


Li 


functional data in ! 

^erre & Yao (20031: 

Eerre & Yao (2005): Cook et al. 

data in 

Jiang et al. 

(2014 

)• 


199lh. which h as been extended to 
20 ini) and to longitudinal 


Functional Regression Models with Functional Response. For a function Y on Jy and a 

single functional covariate X{t), s G Jx, two major models have been considered, 


and 


Y (s) — /3o(s) -|- /3(s)X(s) -|- e(s). 


y(s) = 00 ( 5 ) 4- / a{s,t)X^(t)dt + e{s), 
J Ix 


(14) 


(15) 


where /3o(s) and ao('S) are non-random functions that play the role of functional intercepts, and 
P{s) and a{s,t) are non-random coefficient functions that play the role of functional slopes. 

Model flTTl) implicitly assumes that Ix = ly and is most often referred to as “varying-coefficient” 
model. Given s, F(s) and 2^(s) follow the traditional linear model, but the covariate effects may 
change with time s. This model assumes that the value of Y at time s depends only on the current 
value of (s) and not the history {X{t) : t < s} or future values, hence it is a “concurrent regression 
model”. A simple and effective approach to estimate (3 is to hrst ht model (ITTl) locally in a neighbor¬ 
hood of s using ordinary least square methods to get an initial e stimate dfs). and th en to smooth 


these initial estimates /3(s) across s to get the hnal estimate /3 (IFan fc Zhand. 




19991). In addition 


to such a two-step procedure, one-step smoothing methods have been also studied ( Hoover et ah 


1998 


Wn fc Chiang 


,H; 


Chiang et al. 


200 


as hypothesis testi ng and conhdence bands flWn et al. 


Eggermont et al.l. I 2 OIOI: iHnang et 


1998 


al. 


Huang et ahl . l2004l ). There are also 


2 OO 2 I ). as well 


two review papers fjWu fc Yul . 120021: iFan fc Zhand. 120081) on this to pic. More complex varying coeffi¬ 
cient models include the nested model in iRrnmback fc Ricel (119981) , the co yariate adjusted model in 


Zhu et al. 


(120141 ). among 


Sentiirk fc Miilleil (120051 ). and the multivariate varying-coefficent model in 
others. 

Model (IT^ is generally referred to as functional linear model (FLM), and it differs in crucial 
aspects from the varying coefficient model (ITTl) : At any given time s, the value of F(s) depends on 
the entire trajectory of X. It is a direct extension of traditional linear models with multivariate 
response and vector covariates by changing the inner product from the Euclidean vector space to 
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L^. This model also is a direct extension of model ([9]) when the scalar Y is replaced by Y (s) and the 
coefficient fnnction d var ies with s, leading to a bivariate coefficient snrface. It was first stndied by 
Ramsay fc Dalzelll (11991I1 . who proposed a penalized least sqnares method to estimate the regression 
coefficient snrface P{s,t). When Ix = ly, it is often reasonable to assnme that only the history of 
X affects y, i. e., that 0(s,t) = 0 for s < t. This has been referred to as the “historical fnnctional 
linear model” flMalfait Sz Ramsavl . 120031) . becanse only the history of the covariate is nsed to model 
the response process. This model deserves more attention. 

When X G TZ^ and Y G 7Z‘^ are random vectors, the normal eqnation of the least sqnares 
regression of R on X is cov(X, R) =cov(X, X)/3, where f3 is a p x q matrix. Here a solntion can 
be easily obtained if cov(X, X) is of fnll rank so its inverse exists. An extension of the normal 
eqnation to fnnctional X and R is straightforward by replacing the covariance matrices by their 
corresponding covariance operator. However, an ill-posed problem emerges for the fnnctional normal 
eqnations. Specifically, if for paired processes (X, R) the cross-covariance fnnction is rxy(s,t) = 
cov(X(s), R(t)) and rxx{s,t) = cov(X(s), X(t)) is the anto-covariance fnnction of X, we define 
the linear operator, Rxx : x —)■ x by (Ry x0)( s, t) = f rxx(s, w)/3(w,t)dw. Then a 


“fnnctional normal eqnation” takes the form (IHe et ah 


2000 ) 


rxY — Rxxfd, for (3 G L?{Ix x Ix)- 

Since Rxx is a compact operator in its inverse is not bonnded leading to an ill-posed prob- 


l em. Regn 


( He et ah 


ariza tion is thns needed in analogy to the sitnation for FCCA described in Section 3.1 
200311 . The fnnctional linear model ([9]) is similarly ill-posed bnt not for the varying 


coefficient model fimi becanse the normal eqnation for the varying-coefficient model can be solved 
locally at each time point and does not involve inverting an operator. 

Dne to the ill-posed natnre of the fnnctional linear model, the asymptotic behavior of the 
regression estimators varies in the three design settings. For instance, a ^/n rate is attainable nnder 
the varying-coefficient model flTTll for completely observed fnnctional data or dense fnnctional data 
possibly contaminated with measnrement errors, bnt not for the other two fnnctional linear models 
(19|) and flTSl) nnless the fnnctional data can be expanded by a finite nnmber of basis fnnctions. 
The convergence r ate for (191) depends on how fast the eige nvalnes decay to zero and on regnlarity 
assnmptions on 13 flCai fc Halil2006l:lHall fc Horowitzl . 120071) even when fnnctional data are observed 
continnonsly withont error. An interesting phenomenon is that prediction for model ([9]) follows a 
different paradigm in which y/n convergence is attainable if th e predictor X is s nfficiently smooth 


and the eigenvalnes of predictor processes are well behav ed fICai fc Hali 120061). Estimation for 
(3 and asymptotic theory for model fllSp were explored in Yao et ah fj2005b[) : Ine et all (j201ol) for 
sparse fnnctional data. 

As with scalar responses, both the varying coefficient model flTT|) and fnnctional linear model fflSj) 
can accommodate vector covariates and mnltiple fnnctional covariates. Since each component of the 
vector covariate can be treated as a fnnctional covariate with a constant valne we only discnss the 
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extension to multiple functional covariates, Xi,..., Xp, noting that interaction terms can be added 
as needed. The only change we need to make on the models is to replace the term /3(s)X(s) in ([TT|) 
by the term l3{s,t)X{t)dt in flT^ by //^ l3j{s,t)Xj(t)dt, where Ixj 

is the domain of Xj. If there are many predictors, a variable selection problem may be encountered, 
and when using basis expansions it is natural to employ a group lasso or similar constrained multiple 
variable selection method under sparsity or other suitable assumptions. 

Generalized versions can be developed by adding a pre-specihed link function g in models (imi 
and ([15]). For the case of the varying coefficient model and sparse functional data this has been 


investigated in lgentiirk Sz Muller 
flTbl) and dense functional data in 
hcients for each function. 


20081) for the generalized v arying coefficient model and for model 


James fc Silverman! (120051 ) for a hnite number of expansion coef- 


Jiang fc Wangi (120111 ) considered a setting where the link function may 


vary with time but the /5 in the index does not ch ange vales overti me . The proposed dimension 
reduction approach expands the MAVE method by 


Xia et ah 


(120021) to functional data. 


Random Effects Models. In addition to targeting hxed effects regression, the nonparametric 
modeling of random effects is also of i nterest. One ap p roach is to extend the FPCA approach 
of Se ction 2 to incorporate covariates (ICardot fc Sardal . l2006l: Ijiang et ahl. l2009l: I Jiang fc Wane . 
20101) . These approaches are aiming to incorporate low dimensional projections of covariates to 
alleviate the curse of dimensionality for nonparametric procedures. One scenario where it is easy 
to implement covariate adjusted FPCA is the case where one has functional responses and vector 
covariates. One could conduct a pooled FPCA combining all data as a hrst step and then to use 
the FPCA scores obtained from the hrst stage to m odel covariate effects through a single-index 
model at each FPCA component (jChiou et al.l . 120031) . At this time, such approaches require dense 
functional data, as for sparse data individual FPC scores cannot be estimated consistently. 


4 Clustering and classification of functional data 

Clustering and classihcation are useful tools for traditional multivariate data analysis and are equally 
important yet more challenging in functional data analysis. Clustering aims to group a set of data 
into a conhguration in such a way that data objects within clusters are more similar than across 
clusters with respect to a certain metric. In contrast, classihcation aims to assign an individual 
to a pre-determined group or class based on class-labeled observations. In the terminology of ma¬ 
chine learning, functional data clustering is an unsupervised learning process while functional data 
classihcation is a supervised learning procedure. While clustering aims to identify groups using a 
clustering criterion, classihcation assigns a new data object to a pre-determined group by a discrim¬ 
inant function or a classiher. Functional classihcation typically involves training data containing a 
functional predictor with an associated multi-class label for each data object. The discrimination 
procedure of functional classihcation is closely related to functional clustering, although the goals 
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are different. When cluster centers can be established in functional data clustering, the criteria for 
hnding clusters can also be used for classihcation. Methodology for clustering and classihcation of 
functional data has advanced rapidly during the past decades, due to rising demand for such meth¬ 
ods in data applications. In view of the vast literature on functional clustering and classihcation, 
we focus in the following on only a few typical methods. 

4.1 Clustering of functional data 

For vector-valued multivariate data, hierarchical clustering and the fc-means methods are two clas¬ 
sical and popular approaches. Hierarchical clustering is an algorithmic approach, using either ag- 
glomerative or divisive strategies, that requires a dissimilarity measure between sets of observations 
to decide which clusters should be combined or where a cluster should be split. In the fc-means clus¬ 
tering method, the underlying assumption hinges on cluster centers, the means of the clusters. The 
cluster centers are dehned through algorithms aiming to partition the observations into k clusters 
such that the within-cluster sum of squares, centering around the means, is minimized. Classical 
clustering concepts for vector-valued multivariate data can typically be extended to functional data, 
where various additional considerations arise, such as discrete approximations of distance measures, 
and dimension reduction of the inhnite-dimensional functional data objects. In particular, fc-means 
type clustering algorithms have been widely applied to functional data, and are more popular than 
hierarchical clustering algorithms. It is natural to view cluster mean functions as the cluster centers 
in functional clustering. 

Specihcally, for a sample of functional data {Xj(t); z = 1,..., n}, the /c-means functional cluster¬ 
ing aims to hnd a set of cluster centers {/i^,..., p'^}, assuming there are L clusters, by minimizing 
the sum of the squared distances between {W} and the cluster centers that are associated with their 
cluster labels {C,; i = 1,... ,n}, for a suitable functional distance d. That is, the n observations 
{Xi} are partitioned into L groups such that 

n 

( 16 ) 

n 

2 = 1 

is minimized over all possible sets of functions {pjj; c = 1 ,..., L}, where 

and Nc = l{Ci=c}- Since functional data are discretely recorded, frequently contaminated with 
measurement errors, and can be sparsely or irregularly sampled, a common approach to achieve flTBD 
is to project functional data of inhnite-dimension onto a low dimensional space of a set of basis 
functions, similarly to the implementations of functional correlation and regression. The distance 
d is often chosen as the distance. The following two basic approaches are commonly used in 
functional data clustering. 

Functional Basis Expansion Approach. As described before, once a set of basis functions 
{(pi, (p 2 , • • •} of has been chosen, the hrst K projections Bj. = {X^, ip^), k = 1,K, of the 
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observed trajectories onto the space spanned by the basis fnnctions are used to represent the func¬ 
tional data. An example is the truncated Karhunen-Loeve expansion of Xj in ([2]), where the basis 
is chosen as the eigenfunctions {0i, 02 , • • •} of the auto-covariance operator of the underlying pro¬ 
cess X. Since functional data are realizations of random functions, it is intuitive to consider the 
stochastic structure of random functions to determine the cluster centers as the subspaces spanned 
by the mean and the set of the eigenfunctions, the structure of the random functions, in contrast 
to the /c-means functional clustering that takes the mean functions as cluster centers. Th i s idea 
is irn plemented in the subspace projected functional data clustering approach fjChiou fc Li 120071. 
20081) . Moreover, statistical models can also be used as cluster centers to depict a group of similar 
data, for example by applying mixture models. 

There is a vast amount of the literature on functional data clustering during the past decade, 
including methodological development and a broad range of applications. Some selected approaches 
to be discussed below include the fc-means type of clustering in Section 14.1.11 This is followed in 
Section l4.1.2l bv more details about the subspace projected clustering methods, and in Section 14. 1.31 
by model-based functional clustering approaches. 


4.1.1 Mean functions as cluster centers 

The traditional fc-mean clustering for vector-valued multivariate data has been extended to func¬ 
tional data using the mean function as cluster centers. The approaches can be divided into two 
categories. 

Functional Clustering via Functional Basis Expansion. In the functional basis expansion 
approach, the functional data are projected onto the same set of basis functions irrespective of 
cluster membership, and the sets of basis coefficients {B^i] k = 1,..., K} serve as the proxies of 
individual trajectories. Thus, the distribution patterns of the {Bi^} reflect the clustering patterns 
of the set of functional data. A typical functional clustering approach then is to represent the 
functional data by the coefficients of a basis expansion, this requires carefully choosing a particular 
set of basis functions, and then using available clustering algorithms for multivariate data, such as 
the fc-mean algorithm, to partition the estimated sets of coefficien t s. 


Such t wo stage clustering has been a dopted in 


Abraham et ah 


(120031 ) using B-spline basis func¬ 


tions and ISerban fc Wassermanl (1200511 using Fourier b asis functions coupled with the /c-means 
algorithm, as well as iGarcia-Escudero fc Gordalizal (120051) using B-splines with a robust trimmed k- 
means method. By clustering the fitted sets of coefficients through the fc-means algorithms, 

one obtains the set of cluster centers ..., 5^} on the pr ojected space, and thu s the set of 


cluster centers c = 1,..., L}, where = Yl!k=i 
strong consistency property of this clustering m ethod that has been implemented wit 


Abraham et ah 


( 2 OO 3 I) derived the 


1 various basis 


funct ions, such as P-splines (IGoffev et al 


2010l ). and the wavelet basis (IGiacofci et al 


20 


a Gaussian ortho-normalized basis (iKavano et al. 


20131). 
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Functional Clustering via FPCA. In contrast to the functional basis expansion approach that 
need to choose a particular set of basis functions, the finite approximation FPCA approach ([2]) uses 
data-adaptive basis functions that are determined by the covariance function of the functional data. 
Then the distributions of the sets of FPCs {Aik} indicate different cluster patterns, while the overall 



used a fc-means algorithm on the FPCs as an initial clustering step for the subspace projected 
/c-centers functional clustering algorithm. When the mean functions are the cluster centers, the 
initial step of the approach works reasonably well. However, when the clu ster centers r eflect specific 
features of the structure of the covariance functions, the approach of IChion fc Lil (1200711 to be 
described in the next subsection can further improve the quality of clustering. 


4.1.2 Subspaces as cluster centers. 

Clusters can be defined via subspace projection such that cluster centers lie on sets of basis functions 
of cluster subspaces, rather than mean functions. This idea is particularly sensible in functional 
data clustering by observing that the truncated Karhunen-Loeve representation ([2]) of a random 
function in comprises a fixed component, a mean function, and a random component, a linear 
combination of the eigenfunctions of the covariance operator with weights determined by the FPCs. 
Since each cluster contains a subset of data sampled from random functions in L^, and each subset 
of data lies in a subspace of the structure of the stochastic representation can be used to identify 
clusters of functional data. IChiou fc Lil (120071 ) consider a FPC subspace spanned by a mean function 


and a set of eigenfunctions, and define clusters via FPC subspace projection. The ideas are briefly 
explained as follows. 

Let C be the cluster membership variable, and the FPC subspace = {/r^, (j)1,..., c = 

1,..., L, assuming that there are L clusters. The projected function of Xi onto the FPC subspace 
S'^ can be written as 


K, 




(17) 


k=l 


The sub space-projected k-centers functional clustering procedure (jChiou fc Lil. 120071) aims to find 
the set of cluster centers c = 1,..., K}, such that the best cluster membership of Aj, c*{Xi), is 
determined by minimizing the discrepancy between the projected function Xf and the observation 
Xi such that 

n 

c*{Xi) = argmin ^ d^{Xi, X{). (18) 

In contrast, the /c-means clustering aims to find the set of cluster sample means as the cluster 
centers, rather than the subspaces spanned by c = 1,..., L} as the cluster centers. The initial 
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step of the subspace-projected clustering procedure considers that contains only which re¬ 
duces to the fc-means functional clustering. In the iterative steps, the set of eigenfunctions for each 
cluster is obtained and identihes the set of cluster subspaces {5'^}. The iteration runs until con¬ 
vergence. This functional clustering approach simultaneously identihes the structural components 
of the stochastic representation for each cluster. The idea of the fc-centers function clustering via 
subspace projection was further developed to clustering functional data w ith similar shapes based 
on a shape function model with random scaling effects fjChiou fc Lil. 1200811 . 

More generally, in probabilistic clustering the cluster membership of Xj may be determined by 
maximizing the conditional cluster membership probability given Xj, Pc\x{c \ Xj), such that 


c*{Xi) = argmaxPc'|x(c | Xj). 


(19) 


This criterion requires modeling of the conditional probability Pc\x{' I ■)• can be achieved by a 
generative approach that requires a joint probability mo del or alterna tively through a discriminative 
approach using, for example, a multi-class logit model f Chiou . I 2 OI 2 I. 

For the /c-means type or the fc-centers functional clustering algorithms, the number of clusters is 
pre-determined. The number of clusters for subspace projected functional clustering can be deter¬ 
mined by hnding the maxi mum number of cl usters while retaining signihcant differences between 


pairs of cluster subspaces. 


Li fc Chiou 




developed the forward functional testing procedure 


to identify the total number of clusters under the framework of subspace projected functional data 
clustering. 


4.1.3 Mixture models as the cluster centers 


Model-based clustering flBanheld fc Raftervl. 119931) using mixture models is widely used in cluster¬ 
ing vector-valued multivariate data and has been extended to functional data clustering. Here the 
models of the mixtures underlie the cluster centers. Similarly to the fc-means type of functional 
data clustering, model-based approaches to functional data clusteri ng start by projecting inhnite 
dimensional functional data onto low-dimensional subspaces. E.g., I James fc Sugar! (120031) intro¬ 
duced functional clustering models based on Gaussian mixture distributions for the natural cubic 
spline basis coefficients, with emphasis on clustering sparsely sampled functional data. Similarly, 
Jacques fc Predal (12014 120131) applied the idea of Gaussian mixture modeling to FPCA scores. All 
these methods are based on truncated expansions as in (E]). 

Random effect modeling also provides a model-based clustering approach, that can be based on 


mixed effects mode 


data (ICoffev et al 


s wit h B-splines or P-splines, for example to cluster time-course gene expression 
2 OI 4 I) . For clustering longitudinal data, a linear mixed mo del for clustering usin g 


a penalized normal mixture as random effects distribution has been studied (iHeinzl fc Tutzl. I20l4 . 
Bayesian hierarchical clustering also plays an important role in the development of model-based 
functional clustering, which typically assumes Gaussian mixture distributions on the sets of basis 
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coefficients fitted to individual trajectories. Dirichlet processes are frequentl y used as the prior of 


the mixture dis 


Rodriguez et al. 


ributions and to deal with uncertainty of cluste r numbers (lAugeliui et al.l . l2012l: 


2009 


Petroue et ah 


■ 12009 


Heinzl fc Tut 


I boiaL 


4.2 Classification of functional data 

While functional clustering aims at hnding clusters by minimizing an objective function such as ffTOj) 
and (1T8|) . or more generally, by maximizing the conditional probability as in flT^ . functional clas- 
sihcation assigns a group membership to a new data object with a discriminant function or a 
classiher. Popular approaches for functional data classihcation are based on functional regression 
models that feature the class labels as the response variable and the observed functional data and 
other covariates as the predictors. This view leads to the development of regression based functional 
data classihcation using, for example, functional generalized linear regression models and functional 
multiclass logit models. Similar to approaches of functional data clustering, most functional data 
classihcation methods apply a dimension reduction technique using a truncated expansion in a 
pre-specihed function basis or in the data-adaptive eigenbasis. 


4.2.1 Functional regression for classification 


For regression-base d functional classihcation models, functional generalized linear models (jJamesl . 


2002 


Miillerl . 120051 ) or more specihcally, functional binary regression, such as functional logistic 


regression, are popular approaches. Let {(Zj, W); i = 1,..., n} be a set of random sample, where 
Zj represents a class label, Zj G {1,..., L} for L classes, associated with the observation W. A 
classihcation model for an observation Xq based on functional logistic regression is 
Pr(Z = k\Xo) 


log- 


= 7ofc+ / Xo{t)'yik{t)dt, /c = 1 ,... ,L - 1, 


( 20 ) 


Fr(Z, = L\Xo) JT 

where jok is an intercept term and 7ifc(t) is the coefficient function of the predictor Xo{t) to be 
htted by the sample data. Here, Pr{Zi = L \ Xi) = 1 — Yl!k=i P riZj = k \ X^). This is a fun ctional 
extension of the baseline odds model in multinomial regression flMcCullagh fc Nelderl . 119831) . 

Given a new observation Xq, the model-based Bayes classihcation rule is to choose the class 


label Zq with the maxim a 


pos terior probability among {Pr(Zo = k \ Wq); k = 1 ,...,L}. More 


generally, iLeng fc Miilleii (120061 ) used the generalized functional linear regression model based on 


regression model, several variants of which have been studied 

Araki et al.l. 

2011 ; 

Wang et ah. 

2007; 

Zhu et ah. 

2010 ; 

Rincon & Ruiz-Medina. 

2012 

)• 


2009 


Matsui et al. 


4.2.2 Functional discriminant analysis for classification 

In contrast to the regression-based functional classihcation approach, another popular approach 
is based on the classical linear discriminant analysis method. The basic idea of this approach is 


23 









































































to classify according to the largest conditional probability of the class label variable given a new 
data object by the Bayes rule or classiher. Suppose that the fcth class has prior probability vr^, 
'n,k=i = 1- Given the density of the fcth class, fk, the posterior probability of a new data object 
Xq is given by the Bayes formula, 


Pr{Z = k\Xo) = 




( 21 ) 


Develop ments along these 
sify curves (jjames fc Hastiel . 


ines include a functional linear discriminant analysis approach to clas- 


20011 ) ■ a functional d ata-analytic appr oach to signal discrimination. 


using the FPCA method for dimension reduction flHal 
heat ion rn 


Zhuetal 


es fo r nonparametric curve discrimination (iFerratv fc Vied . l2003l: IChang et al.l. l2014l: 


et al. 


200 lli and ker n el functional classi- 


2OI2I). 


5 Nonlinear Methods for Functional Data 

Due to the complexity of functional data analysis, which blends stochastic process theory, func¬ 
tional analysis, smoothing and multivariate techniques, most research at this point has focused 


gression ( 

Ramsav & Silverman. 

2005; 

Gai & Hall. 

2006; 

Hall Sz Horowitz. 

2007; 

Miiller et ah. 

2008 

Ritz & Streibig. 

20091). Perhaps owing to the success of these linear approaches, the development 


of nonlinear methods has been much slower. However, in many situations linear methods are not 
adequate. A case in point is the presence of time variation or time warping in many data. This 
means that observation time i tself is randomly disto rted and sometimes time variation constitutes 
the main source of variation fjWang fc Gassed . Il997l) . Statistically efficient models will then need 


to rehect the nonlinear features in the data. 


5.1 Nonlinear Regression Models 

The classical functional regression models are linear models with a combination of functional and 
scalar components in predictors and responses. Models with a linear predictor such as the general¬ 
ized functional linear model and single index models also have usually nonlinear link functions and 
their analysis is much more complex than that of the functional linear model. Yet they still main¬ 
tain many similarities with linear functional models. The boundary between linear and nonlinear 
models is thus in flux. 

Due to the increased flexibility of nonlinear and nonparametric models for functional data, 
one needs to walk a hne line in extending functional linear models. For example, there have 
been various developm ents towards fully nonparametric regression models for functional data 
(IFerraty fc Vienl . 120061 ). These models extend the concept of nonparametric smoothing to the 
case of predictor functions, where for scalar responses Y one considers functional predictors X, 
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aiming aX E{Y \ X) = g{X) for a smooth regression function g. Such an approach is motivated 
by extending usual smoothing methods, such as kernel smoothers, by replacing all differences in 
the predictor space by a functional distance, so that the scaled kernel K{^^) with a bandwidth 
h becomes wher e d is a metric i n the predictor space. For a comprehensive review of 

this approach we refer to iFerratv fc VieuI (120061) . Due to the inhnite nature of the predictors, 
in the unrestricted general functional case such models are subject to a serious form of “curse 
of dimensionality”, as the predictors are inherently inhnite-dimensional. Formally, this is due to 
the inhnite-dimensional nature of functional pr e dicto rs and the associated unfavorable small ball 
probabilities in function space flDelaigle fc Hali l2010h . In some cases, when data are clustered in 
lower-dimensional manifolds, the rates of convergence of the lower dimension will likely apply, as is 
the case for nonparametric regression (IBickel fc Lil . 120071) . counteracting this curse. 

To avoid the curse from the start, it is of interest to consider more structured nonparametric 
models, which sensibly balance sufhcient structure with increased hexibility. Structural stability is 
usually satished if one obtains polynomial rates of convergence of the estimated structural compo¬ 
nents and of the predictors. A variety of such models have been studied in recent years. Popular 
extensions of classical linear regression include single or multiple index models, additive models and 
polynomial regression. Analogous exten sions of functional linear regression models have been stud¬ 


ied. Extensions to single index models (jChen et ah 


2011a|) provide enhanced hexibility and struc¬ 


tural stability with usually polynomial rates of convergen ce. Beyond s ingle index models, anothe r 
powerful dimension reduction tool is the adc 

at a 


whic 

1 has been extenc 

:ed to functional c 

2009 

Lai et ah. 

2012; 

Wang et ah 

2014) 


itive model (IStonel, 


Lin fc Zhang 


1999 


1985 


Hastie fc Tibshirani. 1986) 


You fc Zhoul . I2OO7I: 


Carroll et ah 


Wang et al.l. 120141) . In t hese models i t is ass umed that the time ehect is also 


additive, which may be somewhat restrictive. IZhang et ahl (l2013h studied a time-varying additive 
model whose additive components are bivariate functions of time and a covariate. A downside 
is that two-dimensional smoothing is needed for each component and that the covariate ehect is 
entangled with the time ehect. A special case of this model where one assumes that each of the 
additive components i s the product of an unknown time ehect and an unknown covariate ehect 
fjZhang Sz Wangl . l2015l) involves only one-dimensional smoothing and is easy to interpret and im¬ 
plement. Below we describe a simple additive approach that exploits the independence of FPCA 


scores. 


Additive Functional Regression. Various extensions of additive models to additive functional 
models are also of interest. A hrst option is to utilize functional principal components (s) or scores 
Ak as dehned in ([T]) for dimension reduction of the predictor process or processes X, and then 
to assume that the regression relation is additive in these, rather than linear. While the linear 
functional regression model with scalar response can be written as E{Y \ X) = EY + ^kfdk 
(cf. flTUlB with an inhnite sequence of regression coefficients /?*,, the extension to the additive model 
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is the functional additive model 


E{Y \X)=EY 


OO 

E 

fc=i 


fk{^k 


( 22 ) 


where the component fu nctions are required to be smooth and to satisfy E{fk{Ak)) = 0 flMiiher fc Yaol. 


2008 


Sood et al 


2009|). 


This model can be characterized as frequency-additive. A key feature that makes this model not 
only easy to implement but also accessible to asymptotic analysis even when considering inhnitely 
many predictor components, i.e., the entire inhnite-dimensional predictor process, is a consequence 
of the observation that (with /iy = EY) 


E{Y - /iy I Ak) = E{E{Y - /iy |X) I A,} = f^{A,) \ AJ = MA^), 

i=i 


(23) 


if the functional principal components are assumed to be independent. In this case, simple one¬ 
dimensional smoothing o f the responses agains t the FPCA scores leads to consistent estimates of the 
component functions fk (IMuller fc Yaol . 120081) . A similar phenomenon applies to functional linear 
model in that E(Y — /iy | Ak) = jdkAk, because of the uncorrelatedness of the FPCA scores of the 
predictor processes. A consequence of this is that a functional linear regression can 
into a sequence of inhnitely many simple linear regressions fjChiou fc Miillerl. 


a can 

e decomposed 

. 2997: 

Muller et ah. 


200911 . 


Projections on a hnite number of directions for each of potentially many predictor functions 
that are guided by the relationship between predictors and responses provide an alternative addi¬ 
tive approach that, while ignoring the inhnite dimensional nature of the predictors, is practically 
promising since the projections are formed by taking into consideration the relation between X and 
Y, in contrast to other functional regression models where the predictors are fo r med merely based - 


on the auto-cova r iance structure of predictor processes X (1 James fc Silvermanl . 12005 : 


2nila : 


Fan et al 


Chen et al. 


29141 . 


Still other forms of additive models have been considered for functional data. While model 
can be characterized as frequency-additive, as it is additive in the FPCs, one may ask the 
question whether there are time-additive models. It is immediately clear that since the number of 
time points on an interval domain is uncountable, an unrestricted time-additive model E{Y \ X) = 
'^teio T] feasible. One can resolve this conundrum by assuming that the functions 

ft are smoothly varying in t. Then considering a sequence of time-additive models on increasingly 
dense hnite grids of size p. 


w(g) = ^/j(2f(ij)), 

i=i 


E(y|A'(«,) 

assuming that fj{x) = g(tj,x) for a smooth bivariate function g, leads in the limit p ^ oo to the 
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continuously additive model flMiiller et al.l. 120131) 


E(y|A')= lim [ 


[o,r] 


g{t,X{t)) dt. 


(24) 


This model can be imple mented with a bivaria te spline representation of the function g] it was 
discovered independently by McLean et ahl ( 2014 ). Nonlinear or linear models where individual 
predictor times are better predictors than functional principal components, i.e., regression models 
with time-based rather than frequency-based predictors, have also been considered. These models 
can be viewed as special cases of the continuously additive model fl241) in that o nly a few time 


and their associated additive functions fj{X{tj)) are assumed to be predictive flFerratv et ahl. l2010l) 


poim 


Optimization and Gradients With Functional Predictors. In some applications one may 
wish to maximize the response E{Y \ X) m terms of features of the predictor function X. Examples 
where this is relevant include the evolution of life history trajectories such as reproductive trajec¬ 
tories X in medflies. Maximization of lifetime reproduction Y provides an evolutionary advantage 
but must be gauged against mortality, which cuts off further reproduction and is known to rise 
under strong early reproduction through the “cost of reproduction”. Generally, the outcome Y is 
a characteristic to be maximized. Therefore, gradients in terms of functional predictors X are of 
interest. Extending the functional additive model, one can introduce additive gradient operators 
with arguments in at each predictor level X = {Ai, ^ 42 ,...}, 


r?(M) = '^fk\Ak) [ (l)k{t)u{t)dt, ueL^ 


k=l 


These additive gradient operators then serve to find dir ections in whi c 
enabling a maximal descent algorithm in function space flMiiller fc Yad . 


(25) 


1 resp onses increase, thus 


2nina[) . 


Polynomial Functional Regression. Finally, just as the common linear model can be embedded 


in a more general polynomial versio n, a polynomi a 


linear model has been developed in lYao Sz Miillerl (120lOl) . with quadratic functional regression as 


functional model that extends the functional 


the most prominent social case. With centered predictor processes X^, this model can be written 
as 


E(Y\X)^a + j l3(t)X‘(t)dt + j j "f(s,t)X‘(s)X‘(t)dsdt, 


(26) 


and in addition to the parameter function f3 that it shares with the functional linear model it also 
features a parameter surface 7 . The extension to higher order polynomials is obvious. These models 
can be equivalently represented as polynomials in the corresponding FPCs. A natural question is 
whether the linear model is sufficient or needs to be extended to a model that includes a quadratic 
term. A corresponding test was developed bv iHorvath et al.l (120131) . 
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5.2 Time Warping, Dynamics and Manifold Learning for Functional 
Data 

In addition to amplitnde variation, many fnnctional data are best described by assnming that 
additional time variation is present, i.e, the time axis is distorted by a smooth random process. A 
classical example are growth data. In human growth, the biological age of different children varies 
and this variation has a direct bearing on the growth rate that generally follows similar shapes but 
with subject-specihc timing. 

Time Variation and Cnrve Registration. If both amplitude and time variation are jointly 
present, they cannot be separately identihed, so additional assumptions that break the non- 
identihability are crucial if one wishes to identify and separate these two components, which jointly 
generate the observed variation in the data. An important consequence of the presence of time 
warping is that it renders the cross-sectional mean function inefficient and uninterpretable, because 
if functions have important features such as peaks at different times, ignoring the differences in 
timing when taking a cross-sectional mean will distort these features. Then the mean curve will 
not resemble a i iv of t he sample curves and is not useful as a representative for the sample of curves 
( Ramsay fc Li . 1998 1. 


E arly approaches to time- warped functional data included dynamic time warping fjSakoe fc Chibal. 


1978 


Lawton fc 


SvlvestrX 


1971 


Wang fc Gasser. 1199711 for the registration of speech and self-modeling nonlinear regression 


Kneip fc Gasserl . 1198811 . where in the simplest case one assumes that the 


observed random functions can be modeled as shift-scale family of an unknown template function, 
where shift and scale are subject-specihc random variables. Another traditional method to deal with 
time warping in functional data, which is also referred to as the registration or alignment problem, 
is the landmark method. In this approach special features such as peak locations in functions or 
derivatives are aligned to their average location and then smooth transform ations from the average 
location to the 


o catio n of the feature for a specihc subject are introduced (IKneip fc Gasserl . 


1992 


Gasser fc Kneipl . 1199511 . If well-expressed features are present in all sample curves, the landmark 


method serves as a gold standard for curve alignment. However, landmark alignment requires that 
all landmarks are present and identihable in all sample curves. This is often not the case for noisily 
recorded functional data. Landmarks may also be genuinely missing in some sample functions due 
to stochastic variation. 

The mapping of latent bivariate time warping and amplitude processes into random functions 
can be studied systematically, leading to the dehnition of the mean curve as the funct ion that corre¬ 
spon ds to the bivariate Frechet mean of both time warping and amplitude processes (ILiu fc Muller . 
20041) and this can be exemplihed with a simple approach of dehning time warping functions by 
relative area-under-the curve. Recent approa ches include alignment of function by means of min- 
i mizing a Fisher-Rao rnetric (IWu et al.l. 120141) . alignment of event data by dynamic ti me warping 


flArribas-Gil fc Miillerl. 120141) . and time warping in house price boom and bust modeling fjPeng et ah 














































2ni4h . 


Pairwise Warping. As a specific example of how a warping approach can be developed, we discuss 
a pairwise warping approach that is based on the idea that all relevant information a bout time warp¬ 
ing r esides in pairwise comparisons and the resulting pairwise relative time warps (ITang fc MiiHer . 


20081) . Starting with a sample of n i.i.d. smooth observed curves Yi,Y2, ...,Yn (with suitable modi- 


hcations for situations where the curves are not directly observed but only noisy measurements of 
the curves at a grid of discrete time points are available) we postulate that 


= (€lo,r], 


(27) 


where the Xi are i.i.d. random functions that represent amplitude variation and the hi are the real¬ 
izations of a time warping process h that yields warping functions that represent time variation, are 
strictly monotone and invertible and satisfy hj(0) = 0, hj(T) = T. The time warping functions map 
time onto warped time and since time flows forward only, have to be strictly monotone increasing, 
A recent approach to warping that allows time to flow backwards with possibly declining warping 
functions as well has been applied to housing prices where declines correspond to reversing time 
f Peng et ah . 2014 ). 

To break the non-identihability, iTang fc Mullerl (120081) (from which the following descriptions 
are taken) make the assumptions that the overall curve variation is (at least asymptotically) dom¬ 
inated by time variation, i.e., Xi{t) = fi{t) + 6Zi{t), where 6 vanishes for increasing sample size n, 
the Zi are realizations of a smooth square integrable process and E{h{t)} = t, for t G [0,1]. Then 
warping functions may be represented in a suitable basis that ensures monotonicity and has asso¬ 
ciated random coefficients in the expansion, for example monotonically restricted piecewise linear 
functions. If curve Yi has the associated time warping function hi then the warping function gn. 
that transforms the time scale of curve Yi towards that of Y^ is gikit) = hj{h.^^(f)}, and analogously, 
the pairwise-warping function of curve Y^ towards Yi is gkiif) = hk{h~^{t)}. 

Because warping functions are assumed to have average identify, A'[hj{h^^(f)}|/ifc] = h^^{t), 
and, as gik{t) = hi{h^^{t)}, we hud that h^^{t) = E{gik{t)\hk}, which motivates corresponding 
estimators by plugging in estimates of the pairwise warping functions. This shows that under 
certain regularity assumptions the relevant warping information is indeed contained in the pairwise 
time warpings. 

Promising recent extensions of warping approaches aim at formulating ioint models for ampli¬ 
tude and time variation or for combina t ions o f regression and time variation flKneip fc Ramsavl . 

I 


2008 


Gervinil. 


2015 


Hadiipantelis et al.l . l2015l ). Adopting a joint perspective may lead to better 


interpretability in language warping or better performance in functional regression in the presence 
of warping. 

Functional Manifold Learning. A comprehensive approach to time warping and other nonlinear 
features of functional data such as scale or scale-shift families that simultaneously handles amplitude 
and time warping features is available through manifold learning. A motivation for the use of 
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functional manifold models is t 
a manifold (IDonoho fc Grimed. 


lat im age data that are dominated by random domain shifts he on 


20051) . Similar warping mo dels where the warping corresponds to a 


1995 


Leng fc Miillerl. 120061) . 


random time shift have been studied for functional data (jSilvermanl . 

Such data have low-dimensional representations in a transformed space but are inhnite-dimensional 
in the traditional functional basis expansion including the eigenbasis expansion ([1]). While these 
expansions will always converge in under minimal conditions, in these scenarios they lead to 
an inefficient functional representation in contrast to representations that take advantage of the 
manifold structure. 

When functional data include time warping or otherwise lie on a nonlinear low-dimensional 
manifold that is situated within the ambient inhnite-dimensional functional Hilbert space, desirable 
low-dimensional representations are possible through manifold learning and the resulting nonlin¬ 
ear representations are particularly useful for subsequent statistical analysis. Once a map from 
an underlying low-dimensional vector space into functional space has been determined, this gives 
the desired manifold representatio n. Nonlinear dimension reduction methods, such as locally 


embedding flRoweis &: Saull. l2000li. isometric mapping with Isomap (iTenenbaum et al.l. 120001) and 


mear 


Laplacian eigenmaps ( 

Belkin & Nivoeri. 2003) 

have been successfully applied to image data and are 

particularly useful for 

dme-warped functional data, and also for samples of random density func- 

tions (Kneip & Utikah 

2001 ; 

Zhane: & Muller 

. 2011) or other forms of functional data that contain 

nonlinear structure. In terms of diagnostics, indicators for the presence of functional manifolds are 


plots of FPCs against other FPCs that exhibit “horseshoe” or other curved shapes. 

Among the various manifold learning methods. Isomap can be easily implemented and has been 
shown to be a useful and versatile method for functional data analysis. Specihcally, a modihed 
Isomap learning algorithm that includes a penalty to the empirical geodesic distances to correct for 
noisy data, and employing local smoothing to map data from the manifold into functional space 
has been shown to provide a flexible and b roadly applicab l e app roach to low-dimensional manifold 
modeling of time-warped functional data (jChen fc Miillerl. 120121) . This approach targets “simple” 
functional manifolds M in that are “flat”, i.e., isomorphic to a subspace of Euclidean space, such 
as a Hilbert space version of the “Swiss Roll”. An essential input for Isomap is the distance between 
functional data. A default distance is the distance, but this distance is not always feasible, for 
example when the functional data are only sparsely s ampled. In such cases , the distance needs 
to be replaced by a distance that adjusts to sparsity fjPeng fc Miillerl . 120081) . 

The manifold M is characterized by a coordinate map —> Ad C such that (f is 

bijective, and both (f, are continuous and isometric. For a random function X the mean p in 
the d-dimensional representation space and the manifold mean in the functional space are 
characterized by 

^ = E{ip-\X)}, p^=(p-^(/i). 

The isometry of the map (p implies that the manifold mean is uniquely dehned. 

In addition to obtaining a mean, a second basic task in FDA is to quantify variation. In analogy 
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to the modes of var iation that are available through eigenfunctions and FPCA fjCastro et ahl. Il986l: 
Jones fc Ricel. 119921) . one can define manifold modes of variation 




(y. G 


where the vectors G j = 1,... ,d, are the eigenvectors of the covariance matrix of 
G i.e., cov( 9 ?“^(X)) = ■ Here > ■ ■ ■ > X^ are the corre¬ 

sponding eigenvalues and the modes are represented by varying the scaling factors a. 

Each random function X G AJ then has a unique representation in terms of the d—dimensional 
vector -d = (•di,..., idd) G 


X = +dj = {if ^(X) j = 


i=i 


where (•, •) is the inner product in a nd d,- are uncorrelate d r.v.s with mean 0 and variance 
A^, the functional manifold components (jChen fc Miillerl. 120121) . This representation is a genuine 
dimension reduction of the functional data to the hnite dimension d while the Karhunen-Loeve 
representation in case of functional data that are on a nonlinear manifold in most cases will require 
an inhnite number of components. 


Learning Dynamics Prom Functional Data. Since functional data consist of repeated obser¬ 
vation of (usually) time-dynamic processes, they allow to determine the dynamics of the underlying 
processes. Dynamics are typically assessed with derivatives, and under some regularity conditions 
derivatives X' of square integrable processes X are also square integrable and from the eigenrepre- 
sentation dU) (or representation in another functional basis) one obtains 


= + (28) 

k=l 


where n is the order of derivative. Derivatives of p can be estimated with suitable smoothing 
methods and those of 0 by partial differentiation of covariance surfaces, which is even possible in 
the case of sparsely sampled data where direct differentiation of trajectories would not be possible 
( Liu fc Muller ■ 2009 1. 

For the case where one has differentiable Gaussian processes, since X and X' are jointly Gaussian, 
it is easy to see that flMiiller fc Yaol. l2010bl) 


A-W(() - f,<‘>(() = + Z(t), m = P9) 

var{X(f)| 

This is a linear differential equation with a time-varying function fi{t) and a drift process Z. Here 
Z is a Gaussian process such that Z{t), X{t) are independent at each t. If Z is relatively small, the 
equation is dominated by the linear part and the function (d. Then the behavior of fd characterizes 
different dynamics, where one can distinguish dynamic regression to the mean for those t where 
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(3{t) < 0 and explosive behavior for those t where f3{t) > 0. In the hrst case, deviations of X{t) from 
the mean function fi{t) will diminish, while in the second case they will be increase: An individual 
with a value X (t) above the mean will tend to move even higher above the mean under the explosive 
regimen but will move closer to the mean under dynamic regression to the mean. Thus the function 
(3 that is estimated from the observed functional data embodies the empirical dynamics that can 
be learned from the observed sample of Gaussian random trajectories. 

A nonlinear version of dynamics learning can be developed for the case of non-Gaussian processes 
f Verzelen et ahl. 2012 b This is of interest whenever linear dynamics is not applicable. Nonlinear 
dynamics learning is based on the fact that one always has a function / such that 


E{X'{t) I X{t)} = /{f, X{t)}, X'{t) = f{t, X{t)} + Z{t) , (30) 

with E{Z{t) I X{t)} = 0 almost surely. Generally the function / will be unknown. It can be 
consistently estimated from the observed functional data by nonparametrically regressing deriva¬ 
tives X' against levels X and time t. This can be implemented with simple smoothing methods. 
The dynamics of the processes is then jointly determined by the function / and the drift process 
Z. Nonlinear dynamics learning is of interest to understand the characteristics of the underlying 
stochastic system and can also be used to determine whether individual trajectories are “on track”, 
for example in applications to growth curves. 


6 Outlook and Future Perspectives 


FDA has grown from a methodology with a relatively narrow focus on a sample of fully observed 
functions to encompass other statistical areas that were considered separate. Its applicability is 
steadily growing and now includes much of longitudinal data analysis, providing a rich nonpara- 
metric methodology for a held that has been dominated by parametric random effects models for 
a long time. Of special interest are recent developments in the interface of high-dimensional and 
functional data. There are various aspects to this interface: Gombining functional elements with 
high-dimensional covariates, such as predictor times with in an interva l havin g an individual predic¬ 
tor effect that goes beyond the functional linear model flKneip et ahl. l201lh . or selecting arbitrary 
subsets of functional principal component scores in regression models. 

Another inter: 


2010 


Ghen et af 


ace pf h igh-dimensional and functional data is the method of Stringing flWu fc Miillerl. 


2011 b|), which uses a uni- or multi-dimensional scaling step to order predictors 


along locations on an interval or low-dimensional domain and then assigns the value of the respec¬ 
tive predictor to the location of the predictor on the interval, for all predictors. The distance of 
the predictor locations on the interval matches as closely as possible a distance measure between 
predictors that can be derived from correlations. Gombining locations and predictor values and 
potentially also adding a smoothing step then converts the high-dimensional data for each sub¬ 
ject or item to a random function. These functions can be summarized through their FPG scores. 
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leading to an effective dimension reduction that is not based on sparsity and that works well for 
strongly correlated predictors. These functions can serve as predictors in the framework of one of 
the above described functional regression models. Thus, FDA approaches can take advantage of 
the high-diemensional setting of the data and turns it into an advantage. 

Several open problems were mentioned in Section 2 including the choice of the number of compo¬ 
nents K needed for the approximation Q of the full Karhunen-Loeve expansion ([T]) and the choice 
of the tuning parameters involved in the smoothing steps of estimation. Another less-developed 
area in FDA is outlier detections and robust FDA approaches. In general, approaches for sparse 
functional data are still lagging behind those for dense functional data. 

Many recent developments in FDA have not been covered in this review. These include functional 
designs and domain selection problems and also depen dent functional data such as functional time 
series, with many recent interesting developments, e.g. iPanaretos fc Tavakolil (120131) . Another area 
that has gained recent interest are multivariate functional data. Similarly, in some longitudinal 
studies one observes for each subject repeatedly observed and therefore dependent functional data 
rather than scalars. There is also recently rising i nterest in spatially ind e xed f unctional data. These 
problems pose novel challenges for data analysis flHorvath fc Kokoszkal . 120121) . 

While this review has focused on concepts and not on applications. As for other growing 
statistical areas, a driving force of recent developments in FDA has been the appearance of new types 
of data that require adequate methodology for their analysis. This is leading to “next generation” 
functional data that include more complex features than the first generation functional data that 
have been the emphasis of this review. Examples of recent applications include continuous tracking 
and monitoring of health and movements, temporal gene expression trajectories, transcription factor 
count modeling along the genome, and the analysis of auction data, volatility and other hnancial 
data with functional methods. Last but not least, brain imaging data are intrinsically functional 
data and there is an accelerated interest in the neuroimaging community to analyze neuroimaging 
data with the FDA approach. A separate entry (by John Aston) in this issue deals specihcally with 
this application area. 
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